AITopics

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Austria > Styria > Graz (0.04)

Industry:

Leisure & Entertainment (0.68)
Media > Music (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech (0.68)

Neural Information Processing SystemsFeb-18-2026, 12:42:56 GMT

Efficient Video-to-Audio Generation Network with Rectified Flow Matching Y ongqi Wang

V2A model based on rectified flow matching.

artificial intelligence, machine learning, natural language, (20 more...)

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Media (0.67)
Information Technology (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Social Media (0.93)
(2 more...)

Neural Information Processing SystemsFeb-17-2026, 17:11:11 GMT

b782a3462ee9d566291cff148333ea9b-Paper-Conference.pdf

large language model, machine learning, natural language, (18 more...)

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Neural Information Processing SystemsFeb-14-2026, 18:10:05 GMT

A V-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis Susan Liang 1 Chao Huang

This consistency ensures perceptual realism and immersion, enriching the overall user experience.

artificial intelligence, camera pose, machine learning, (17 more...)

Country:

Asia > Middle East > Israel (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

arXiv.org Artificial IntelligenceDec-3-2025

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Zhang, Mengchen, Chen, Qi, Wu, Tong, Liu, Zihan, Lin, Dahua

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.

artificial intelligence, machine learning, natural language, (20 more...)

2512.03036

Genre: Research Report > Promising Solution (0.34)

Industry:

Media (0.94)
Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

arXiv.org Artificial IntelligenceNov-25-2025

BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

Park, Taesoo, Jeong, Mungwi, Park, Mingyu, Kim, Narae, Kim, Junyoung, Kim, Mujung, Yoo, Jisang, Lee, Hoyun, Kim, Sanghoon, Kwon, Soonchul

This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GANbased vocoder designed for high-fidelity and long-term audio generation. Long-term audio generation is critical for applications in Text-to-Music (TTM) and Text-to-Audio (TTA) systems, where maintaining temporal coherence, prosodic consistency, and harmonic structure over extended durations remains a significant challenge. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including Multi-Scale Discriminator (MSD) + MED, MSD + MRD, and Multi-Period Discriminator (MPD) + MED + MRD, using objective metrics (Fréchet Audio Distance (FAD), Structural Similarity Index (SSIM), Pearson Correlation Coefficient (PCC), Mel-Cepstral Distortion (MCD)) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.

artificial intelligence, discriminator, machine learning, (18 more...)

2506.09487

Country: Asia > South Korea (0.28)

Genre:

Research Report > New Finding (1.00)
Instructional Material > Course Syllabus & Notes (0.90)
Research Report > Experimental Study (0.68)

Industry:

Information Technology (0.68)
Health & Medicine (0.68)
Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Artificial IntelligenceNov-6-2025

TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

He, Yuxuan, Yang, Xiaoran, Pan, Ningning, Huang, Gongping

Most existing text-to-audio (TT A) generation methods produce mono outputs, neglecting essential spatial information for im-mersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.

large language model, machine learning, natural language, (18 more...)

doi: 10.21437/Interspeech.2025-1516

2507.16564

Country: Asia > China > Hubei Province (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceOct-29-2025

Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

Zhang, Kang, Pham, Trung X., Lee, Suyeon, Niu, Axi, Senocak, Arda, Chung, Joon Son

We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio

machine learning, mgaudio, natural language, (18 more...)

2510.24103

Genre: Research Report > Promising Solution (0.46)

Industry: Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

arXiv.org Artificial IntelligenceOct-15-2025

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Tian, Jinchuan, Lee, Sang-gil, Kong, Zhifeng, Ghosh, Sreyan, Goel, Arushi, Yang, Chao-Han Huck, Dai, Wenliang, Liu, Zihan, Ye, Hanrong, Watanabe, Shinji, Shoeybi, Mohammad, Catanzaro, Bryan, Valle, Rafael, Ping, Wei

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. V ery few studies attempt to unify these tasks - an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations. Figure 1: Humans need understanding, generation, and reasoning to handle complex tasks, like composing music. Human auditory intelligence is characterized by two fundamental capabilities: perception (understanding) and production (generation). This duality is not merely conceptual; neuro-scientific evidence reveals a profound synergy between these functions, where impairment in one often corresponds to a deficit in the other (Liberman et al., 1967; Hickok & Poeppel, 2007; Rizzolatti & Craighero, 2004). Furthermore, resolving complex acoustic challenges requires a sophisticated reasoning process that is inherently multimodal (McGurk & MacDonald, 1976; Leman, 2007; Denes & Pinson, 1993; Liberman & Mattingly, 1985).

arxiv preprint arxiv, machine learning, natural language, (14 more...)